Project Case Scenario: Boston Real Estate¶
Overview¶
- Programming language: Python
- Tools: NumPy, pandas, Matplotlib, Seaborn, and Plotly Express in Jupyter Notebook
- Objective: To provide insight into the Boston, Massachusetts real estate market and answer the following questions regarding the community.
- How would you visualize the median value of owner-occupied homes using two different methods of plotting?
- How would you visualize the number of houses bounded and not bounded by the Charles River with two different methods of plotting? Is there a significant di fference in the median value of these houses?
- Is there a difference in the median values of houses for each proportion of owner-occupied units built before 1940?
- Is there a relationship between nitric oxide concentrations and the proportion of non-retail business acres per town?
- What is the impact of an additional weighted distance to the five Boston employment centres on the median value of owner-occupied homes?
- Dataset: housing information and prices derived from the U.S. Census Service.
- Dataset variables:
- CRIM: per capita crime rate by town
- ZN: proportion of residential land zoned for lots over 25,000 square feet
- INDUS: proportion of non-retail business acres per town
- CHAS: whether the tract bounds the Charles River (1 if tract bounds river, 0 if the tract does not)
- NOX: nitric oxides concentration (parts per 10 million)
- RM: average number of rooms per dwelling
- AGE: proportion of owner-occupied units built prior to 1940
- DIS: weighted distances to five Boston employment centers
- RAD: index of accessibility to radial highways
- TAX: full-value property-tax rate per $10,000
- PTRATIO: pupil-teacher ratio by town
- LSTAT: percentage lower status of the population
- MEDV: Median value of owner-occupied homes in units of $1,000
- This project was initially completed as part of Coursera's IBM "Statistics for Data Science with Python" course. I independently rewrote, refined and extended it by adding additional code, analyses, and visualizations.
Table of Contents¶
- Set Up
- Data Exploration
- Objective #1
- Visualize the median value of owner-occupied homes using two different methods of plotting.
- Objective #2
- Visualize the number of houses bounded and not bounded by the Charles River with two different methods of plotting. Is there a significant difference in the median value of these houses?
- Objective #3
- Is there a difference in the median values of houses for each proportion of owner-occupied units built before 1940?
- Objective #4
- Is there a relationship between nitric oxide concentrations and the proportion of non-retail business acres per town?
- Objective #5
- What is the impact of an additional weighted distance to the five Boston employment centres on the median value of owner-occupied homes?
- Summary
# Surpress warnings:
def warn(*args, **kwargs):
pass
import warnings
warnings.warn = warn
# Import libraries
import numpy as np
import pandas as pd
import seaborn as sns
import scipy.stats
import statsmodels.api as sm
from statsmodels.stats.multicomp import pairwise_tukeyhsd
import plotly.express as px
import plotly.graph_objects as go
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
Import Data¶
Load the real estate dataset and drop the first column.
# Load data
boston_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ST0151EN-SkillsNetwork/labs/boston_housing.csv'
boston_df = pd.read_csv(boston_url)
boston_df.drop('Unnamed: 0', axis=1, inplace=True) #remove unused variable
Add a new variable to represent the full value of owner-occupied homes.
# Calculate variable for full value
boston_df["MEDV_units"] = boston_df["MEDV"]*1000
# Find the size of dataset
print("There are currently", boston_df.shape[0], "rows and", boston_df.shape[1], "variables in this dataset.")
There are currently 506 rows and 14 variables in this dataset.
# Load the first five rows of the dataset
boston_df.head()
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | LSTAT | MEDV | MEDV_units | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.00632 | 18.0 | 2.31 | 0.0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1.0 | 296.0 | 15.3 | 4.98 | 24.0 | 24000.0 |
| 1 | 0.02731 | 0.0 | 7.07 | 0.0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2.0 | 242.0 | 17.8 | 9.14 | 21.6 | 21600.0 |
| 2 | 0.02729 | 0.0 | 7.07 | 0.0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2.0 | 242.0 | 17.8 | 4.03 | 34.7 | 34700.0 |
| 3 | 0.03237 | 0.0 | 2.18 | 0.0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3.0 | 222.0 | 18.7 | 2.94 | 33.4 | 33400.0 |
| 4 | 0.06905 | 0.0 | 2.18 | 0.0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3.0 | 222.0 | 18.7 | 5.33 | 36.2 | 36200.0 |
# Find the variable types
boston_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CRIM 506 non-null float64 1 ZN 506 non-null float64 2 INDUS 506 non-null float64 3 CHAS 506 non-null float64 4 NOX 506 non-null float64 5 RM 506 non-null float64 6 AGE 506 non-null float64 7 DIS 506 non-null float64 8 RAD 506 non-null float64 9 TAX 506 non-null float64 10 PTRATIO 506 non-null float64 11 LSTAT 506 non-null float64 12 MEDV 506 non-null float64 13 MEDV_units 506 non-null float64 dtypes: float64(14) memory usage: 55.5 KB
# Descriptive statistics for all variables
boston_df.describe()
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | LSTAT | MEDV | MEDV_units | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 | 506.000000 |
| mean | 3.613524 | 11.363636 | 11.136779 | 0.069170 | 0.554695 | 6.284634 | 68.574901 | 3.795043 | 9.549407 | 408.237154 | 18.455534 | 12.653063 | 22.532806 | 22532.806324 |
| std | 8.601545 | 23.322453 | 6.860353 | 0.253994 | 0.115878 | 0.702617 | 28.148861 | 2.105710 | 8.707259 | 168.537116 | 2.164946 | 7.141062 | 9.197104 | 9197.104087 |
| min | 0.006320 | 0.000000 | 0.460000 | 0.000000 | 0.385000 | 3.561000 | 2.900000 | 1.129600 | 1.000000 | 187.000000 | 12.600000 | 1.730000 | 5.000000 | 5000.000000 |
| 25% | 0.082045 | 0.000000 | 5.190000 | 0.000000 | 0.449000 | 5.885500 | 45.025000 | 2.100175 | 4.000000 | 279.000000 | 17.400000 | 6.950000 | 17.025000 | 17025.000000 |
| 50% | 0.256510 | 0.000000 | 9.690000 | 0.000000 | 0.538000 | 6.208500 | 77.500000 | 3.207450 | 5.000000 | 330.000000 | 19.050000 | 11.360000 | 21.200000 | 21200.000000 |
| 75% | 3.677083 | 12.500000 | 18.100000 | 0.000000 | 0.624000 | 6.623500 | 94.075000 | 5.188425 | 24.000000 | 666.000000 | 20.200000 | 16.955000 | 25.000000 | 25000.000000 |
| max | 88.976200 | 100.000000 | 27.740000 | 1.000000 | 0.871000 | 8.780000 | 100.000000 | 12.126500 | 24.000000 | 711.000000 | 22.000000 | 37.970000 | 50.000000 | 50000.000000 |
Objective #1: visualize the median value of owner-occupied homes using two different libraries for plotting.¶
Boxplots are created to assess the price distribution of owner-occupied homes. Boxplots will be plotted using two different Python libraries, Seaborn and Plotly.
Method #1: a static visualization with Seaborn¶
# Graphing a boxplot with Seaborn
Q1_fig = sns.boxplot(data=boston_df,
y="MEDV_units",
linecolor="black",
palette="Blues")
Q1_fig.set(ylabel = "Price", title ='Median Values of Owner-Occupied Homes')
Q1_fig.get_yaxis().set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}')) # format y-axis price display
plt.show()
Method #2: an interactive visualization with Plotly Express¶
# Graphing a boxplot with Plotly Express
Q1_fig2 = px.box(boston_df,
y="MEDV_units",
width = 500,
height = 600,
title = "Median Values of Owner-Occupied Homes",
labels={"MEDV_units": "Price"},
template="simple_white")
Q1_fig2.update_layout(title_x=0.5)
Q1_fig2.update_layout(margin=dict(l=30, r=30, t=50, b=20))
Q1_fig2.show()
From the boxplots, it seems like there is a large spread of house values, with the median value of houses appearing to be slightly over $20,000. Descriptive statistics can be calculated to provide additional information on the distribution of house values.
# Print descriptive statistics of house values
print("\nBasic descriptive statistics can seen here:")
funcs = {'average': np.mean,
'standard deviation': np.std,
'median': np.median,
'lowest': np.min,
'highest': np.max}
for key, value in funcs.items():
print("\tThe ", key," value is ${:,.2f}".format(boston_df['MEDV_units'].aggregate(value)), ".", sep="")
# print descriptive statistics of house values in dataframe
print("\nAdditional descriptive statistics:\n")
MEDV_table = pd.DataFrame(boston_df["MEDV_units"].describe().round(2)).style.format('{:,.2f}')
MEDV_table.columns = ["Median Value of Homes"]
MEDV_table
Basic descriptive statistics can seen here: The average value is $22,532.81. The standard deviation value is $9,197.10. The median value is $21,200.00. The lowest value is $5,000.00. The highest value is $50,000.00. Additional descriptive statistics:
| MEDV_units | |
|---|---|
| count | 506.00 |
| mean | 22,532.81 |
| std | 9,197.10 |
| min | 5,000.00 |
| 25% | 17,025.00 |
| 50% | 21,200.00 |
| 75% | 25,000.00 |
| max | 50,000.00 |
From the boxplots and descriptive statistics, the range of house values in this dataset span from $5,000.00 to $50,000.00, with the median value at $21,200 and the mean value at $22,532.81.
Objective #2: is there a significant difference in the median value of houses bounded by the Charles River?¶
Visualize real estate by the Charles River¶
Bar charts are first created to compare the number of houses bounding and not bounding the Charles River. We can rename the types of housing for ease of reading and calculate the total for each kind.
# Create a variable to display the binary values as strings
boston_df.loc[(boston_df['CHAS'] == 0), 'CHAS_STRING'] = 'House Does Not Bound River'
boston_df.loc[(boston_df['CHAS'] == 1), 'CHAS_STRING'] = 'House Bounds River'
# Create a dataframe for housing counts
CHAS_counts = boston_df.CHAS_STRING.value_counts().to_frame().reset_index()
print("\nNumber of houses that do not bound the Charles River:", CHAS_counts.iloc[0,1])
print("Number of houses bound the Charles River:", CHAS_counts.iloc[1,1],"\n")
Number of houses that do not bound the Charles River: 471 Number of houses bound the Charles River: 35
We can then graph the findings with two methods, one static and one interactive. Houses that are not surrounding the river are represented in orange while houses located by the river are shown in blue.
Method #1: a static visualization with Matplotlib¶
# Graphing a bar chart with Matplotlib
plt.bar(boston_df.CHAS_STRING.unique(),
boston_df.CHAS_STRING.value_counts(),
color=['orange','blue'])
plt.xlabel('\nType of House')
plt.ylabel('Count')
plt.title('Number of Houses by the Charles River')
plt.show()
Method #2: an interactive visualization with Plotly Express¶
# Graphing a bar chart with Plotly Express
Q2_fig2 = px.bar(CHAS_counts,
x="CHAS_STRING",
y="count",
width = 700,
height = 600,
color="CHAS_STRING",
title="Number of Houses by the Charles River",
labels={"CHAS_STRING":"Type of House",
"count":"Count"},
color_discrete_sequence=["orange", "blue"],
template="simple_white")
Q2_fig2.update_yaxes(range=[0, CHAS_counts.iloc[0,1].round(-2) ]) # automatically adjust
Q2_fig2.update_layout(title_x=0.5)
Q2_fig2.update_layout(margin=dict(l=30, r=30, t=50, b=20))
Q2_fig2.show()
From both these plots, we can see there are more houses that do not bound the Charles River than those that bound it.
A boxplot can be used to visualize any potential price differences between these different types of houses. As with the above bar chart, houses that are not located by the river are represented in orange while houses next to the river are shown in blue.
# Graphing a boxplot with Plotly Express
Q2_fig3 = px.box(boston_df,
x="CHAS_STRING",
y="MEDV_units",
width = 700,
height = 600,
title ='Housing Price Differences Based on Bounding the Charles River',
labels={
"MEDV_units": "Price",
"CHAS_STRING":"Type of House"},
color="CHAS_STRING",
color_discrete_sequence=["orange", "blue"],
template="simple_white")
Q2_fig3.update_layout(title_x=0.5)
Q2_fig3.update_layout(margin=dict(l=30, r=30, t=50, b=20))
Q2_fig3.show()
The descriptive statistics for each group's housing values are as follows:
# Descriptive statistics
CHAS_STRING_table = boston_df.groupby("CHAS_STRING").aggregate("MEDV_units").describe().round(2).style.format('{:,.2f}')
CHAS_STRING_table.index.name = "Type of House"
CHAS_STRING_table
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| Type of House | ||||||||
| House Bounds River | 35.00 | 28,440.00 | 11,816.64 | 13,400.00 | 21,100.00 | 23,300.00 | 33,150.00 | 50,000.00 |
| House Does Not Bound River | 471.00 | 22,093.84 | 8,831.36 | 5,000.00 | 16,600.00 | 20,900.00 | 24,800.00 | 50,000.00 |
From the boxplot, the mean value of houses that do not bound the Charles River seem to be relatively lower than the mean value of houses that bound the river. The descriptive statistics also support this.
Statistical analysis¶
A t-test for independent samples is conducted to determine whether there is a significant difference in the median value of houses depending on their location to the Charles River.
- Null hypothesis: there is no difference in the median value of houses bounded by the Charles River.
- Alternative hypothesis: there is a difference in the median value of houses bounded by the Charles River.
Levene's test is first conducted to assess whether the variances between groups are equal.
# Levene's test to evaluate whether variances between groups are equal
Levene_test = scipy.stats.levene(boston_df[boston_df['CHAS'] == 0]['MEDV_units'],
boston_df[boston_df['CHAS'] == 1]['MEDV_units'], center='mean')
print("\nLevene's test results:")
print("\t", Levene_test, sep="")
print("\nLevene's test results in a p-value of ", Levene_test[1].round(3),
", which suggests the variances between groups are not equal. As such, the t-test for independent samples will be conducted with unequal variances.\n", sep="")
Levene's test results: LeveneResult(statistic=8.751904896045989, pvalue=0.003238119367639829) Levene's test results in a p-value of 0.003, which suggests the variances between groups are not equal. As such, the t-test for independent samples will be conducted with unequal variances.
# T-test
ttest_var = scipy.stats.ttest_ind(boston_df[boston_df['CHAS'] == 0]['MEDV_units'],
boston_df[boston_df['CHAS'] == 1]['MEDV_units'], equal_var = False)
print("\nResults of a t-test with unequal variances:")
print("\t", ttest_var, "\n", sep="")
if ttest_var[1].round(3) < 0.05:
print("The t-test results in a p-value of ", ttest_var[1].round(3), ". Since the p-value is less than α = 0.05, we reject the null hypothesis.", sep="")
else:
print("The t-test results in a p-value of ", ttest_var[1].round(3), ". Since the p-value is greater than α = 0.05, we fail to reject the null hypothesis.", sep="")
Results of a t-test with unequal variances: TtestResult(statistic=-3.11329131279484, pvalue=0.0035671700981374896, df=36.87640879761199) The t-test results in a p-value of 0.004. Since the p-value is less than α = 0.05, we reject the null hypothesis.
Conclusion¶
This test suggests there is a significant difference in the median value of houses bounded and not bounded by the Charles River.
Objective #3: is there a significant difference in median values of houses for each proportion of owner-occupied units built before 1940?¶
The following graph and analyses compare the median price of owner-occupied homes built before 1940 by their age, which is discretized into three groups (houses 35 years and younger, houses between 35 and 70 years, and houses that are 70 years and older).
The age of the units built before 1940 will first be partitioned into three groups for analysis.
# Discretize age groups
boston_df.loc[(boston_df['AGE'] <= 35), 'age_group'] = '35 Years and Younger'
boston_df.loc[(boston_df['AGE'] > 35)&(boston_df['AGE'] < 70), 'age_group'] = 'Between 35 and 70 Years'
boston_df.loc[(boston_df['AGE'] >= 70), 'age_group'] = '70 Years and Older'
Visualize the median values of homes built before 1940 by their age¶
A boxplot is used to compare the price distribution of owner-occupied homes by their age.
# Plotting with Seaborn
sns.reset_orig()
Q3_fig1 = sns.boxplot(data=boston_df, x="age_group", y="MEDV_units", palette="Blues")
Q3_fig1.set(xlabel="\nProportion of Owner-Occupied Units by Home Age", ylabel = "Value", title ='Median Values of Owner-Occupied Homes By Age')
Q3_fig1.get_yaxis().set_major_formatter(mpl.ticker.StrMethodFormatter('{x:,.0f}'))
sns.set(rc={'figure.figsize':(13,10)})
plt.show()
print("\nDescriptive statistics of each group's housing price are as follows:\n")
boston_df.groupby("age_group").aggregate("MEDV_units").describe().round(2)
Descriptive statistics of each group's housing price are as follows:
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| age_group | ||||||||
| 35 Years and Younger | 91.0 | 27775.82 | 7638.20 | 17100.0 | 23050.0 | 24800.0 | 31150.0 | 50000.0 |
| 70 Years and Older | 287.0 | 19793.38 | 9515.38 | 5000.0 | 13800.0 | 18200.0 | 22550.0 | 50000.0 |
| Between 35 and 70 Years | 128.0 | 24947.66 | 6969.37 | 10200.0 | 20675.0 | 22600.0 | 27425.0 | 50000.0 |
From the boxplot, it seems houses that are 70 years and older have relatively lower values, while houses that are 35 years and younger have relatively higher values. The descriptive statistics of each age group supports this perceived price difference.
Statistical analysis¶
An ANOVA is conducted to determine whether there is a significant difference in the median price of houses for each proportion of owner-occupied units built before 1940.
- Null hypothesis: there is no difference in the median value of houses for each proportion of owner-occupied units built prior to 1940.
- Alternative hypothesis: there is a difference in the median value of houses for each proportion of owner-occupied units built prior to 1940.
# ANOVA
ANOVA_group1 = boston_df[boston_df['age_group'] == '35 Years and Younger']['MEDV_units']
ANOVA_group2 = boston_df[boston_df['age_group'] == 'Between 35 and 70 Years']['MEDV_units']
ANOVA_group3 = boston_df[boston_df['age_group'] == '70 Years and Older']['MEDV_units']
ANOVA_var = scipy.stats.f_oneway(ANOVA_group1, ANOVA_group2, ANOVA_group3)
print("\nANOVA results:")
print("\t",ANOVA_var,sep="")
print("\nAn ANOVA results in a p-value of {:0.3e}".format(ANOVA_var[1]), ". Since this p-value is less than α = 0.05, we reject the null hypothesis.\n", sep="")
ANOVA results: F_onewayResult(statistic=36.40764999196602, pvalue=1.7105011022701769e-15) An ANOVA results in a p-value of 1.711e-15. Since this p-value is less than α = 0.05, we reject the null hypothesis.
Since the ANOVA above is significant, a post hoc analysis using Tukey's HSD test to correct for multiple comparisons is then conducted to evaluate whether differences between group means are significant.
# Run post hoc analysis to determine significantly different pairs
print(pairwise_tukeyhsd(boston_df['MEDV_units'], boston_df['age_group']))
Multiple Comparison of Means - Tukey HSD, FWER=0.05
============================================================================================
group1 group2 meandiff p-adj lower upper reject
--------------------------------------------------------------------------------------------
35 Years and Younger 70 Years and Older -7982.4444 0.0 -10418.1857 -5546.7031 True
35 Years and Younger Between 35 and 70 Years -2828.1679 0.0447 -5604.3202 -52.0157 True
70 Years and Older Between 35 and 70 Years 5154.2765 0.0 3002.3619 7306.191 True
--------------------------------------------------------------------------------------------
The results from the posthoc Tukey's HSD test suggests that the price of all three groups is significantly different from one another. The average median price of houses 35 years and younger is the highest of the three groups, followed by houses between 35 and 70 years. Hhouses that are 70 years and older had the lowest average median price of the three groups.
Conclusion¶
An ANOVA test suggests there is a significant difference in the median value of houses for each proportion of owner-occupied homes built prior to 1940. The average median price of houses 35 years and younger is the highest of the three groups while the average median price of houses 70 years and older is the lowest of the three groups. That is, houses differ in their value based on age, with younger houses tending to be more expensive than older houses.
Objective #4: is there a relationship between nitric oxide concentrations and the proportion of non-retail business acres per town?¶
Assessing the dataset shows that there are several duplicated values for the "INDUS" and "NOX" variables. This means that the same values for the proportion of non-retail business acres per town and nitric oxide concentrations were reported several times, which may prevent accurate results if the dataset is used without proper formatting. To resolve this, rows with duplicate values in the variable are discarded so that only rows with unique values remain.
# Retain only unique values of "INDUS"
boston_df_no_INDUS_duplicates = boston_df.drop_duplicates(subset=["INDUS"])
# Find size of dataset
print('The new dataset includes', boston_df_no_INDUS_duplicates.shape[0], 'rows and', boston_df_no_INDUS_duplicates.shape[1], 'columns.')
The new dataset includes 76 rows and 16 columns.
Visualize the relationship between nitric oxide concentrations and the proportion of non-retail business acres per town.¶
A scatterplot is created to assess the relationship between the two variables.
# Plotting with Seaborn
sns.reset_orig()
Q4_fig = sns.scatterplot(x="INDUS", y="NOX", data=boston_df_no_INDUS_duplicates)
Q4_fig.set(xlabel="Proportion of Non-Retail Business Acres per Town",
ylabel = "Nitric Oxide Concentration\n(Parts per 10 Million)",
title ='Relationship Between Non-Retail Business Acres per Town and Nitric Oxide Concentrations')
plt.show()
There seems to be a positive, linear relationship between the proportion of non-retail business acres per town and the nitric oxide concentrations. Greater proportions of non-retail businesses seems to be related to greater nitric oxide concentrations.
Statistical analysis¶
A Pearson Correlation is conducted to determine whether there is a significant relationship between nitric oxide concentrations and the proportion of non-retail business acres per town.
- Null hypothesis: there is no relationship between nitric oxide concentrations and the proportion of non-retail business acres per town.
- Alternative hypothesis: there is a relationship between nitric oxide concentrations and the proportion of non-retail business acres per town.
# Pearson Correlation
pearsonr_var = scipy.stats.pearsonr(boston_df_no_INDUS_duplicates['INDUS'], boston_df_no_INDUS_duplicates['NOX'])
print("\nPearson Correlation results:")
print("\t",pearsonr_var,sep="")
print("\nA Pearson Correlation results in a p-value of {:0.3e}".format(pearsonr_var[1]), ".", sep="")
print("Since this p-value is less than α = 0.05, we reject the null hypothesis.\n")
Pearson Correlation results: PearsonRResult(statistic=0.6809525800688366, pvalue=1.3018316252583866e-11) A Pearson Correlation results in a p-value of 1.302e-11. Since this p-value is less than α = 0.05, we reject the null hypothesis.
Conclusion¶
This test suggests there is a significant relationship between nitric oxide concentrations and the proportion of non-retail business acres per town. Nitric oxide concentrations and the proportion of non-retail business acres is positively correlated, with increases in one variable associated with increases in the other.
Objective #5: what is the impact of an additional weighted distance to the five Boston employment centers on the median value of owner-occupied homes?¶
Statistical analysis¶
A linear regression is conducted to determine whether there is a significant relationship between weighted distance to the Boston employment centers and median price of owner-occupied homes.
- Null hypothesis: there is no relationship between the weighted distance to the five Boston employment centers and the median value of owner-occupied homes.
- Alternative hypothesis: there is a relationship between the weighted distance to the five Boston employment centers and the median value of owner-occupied homes.
X = boston_df["DIS"]
y = boston_df["MEDV_units"]
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
print("\nRegression results:")
print(model.summary())
if model.pvalues[1].round(3) < 0.05:
print("\nThe regression results in a p-value of {:0.3e}".format(model.pvalues[1]),
"for the independent variable. Since the p-value is less than α = 0.05, we reject the null hypothesis.", sep="")
else:
print("\nThe regression results in a p-value of {:0.3e}".format(model.pvalues[1]),
"for the independent variable. Since the p-value is greater than α = 0.05, we fail to reject the null hypothesis.", sep="")
print("\nFor every additional 1 unit of weighted distance to the five Boston employment centers, the median house price increases on average by ${:0.2f}".format(model.params[1]),
".\n",sep="")
Regression results:
OLS Regression Results
==============================================================================
Dep. Variable: MEDV_units R-squared: 0.062
Model: OLS Adj. R-squared: 0.061
Method: Least Squares F-statistic: 33.58
Date: Wed, 05 Mar 2025 Prob (F-statistic): 1.21e-08
Time: 03:18:35 Log-Likelihood: -5319.2
No. Observations: 506 AIC: 1.064e+04
Df Residuals: 504 BIC: 1.065e+04
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 1.839e+04 817.389 22.499 0.000 1.68e+04 2e+04
DIS 1091.6130 188.378 5.795 0.000 721.509 1461.717
==============================================================================
Omnibus: 139.779 Durbin-Watson: 0.570
Prob(Omnibus): 0.000 Jarque-Bera (JB): 305.104
Skew: 1.466 Prob(JB): 5.59e-67
Kurtosis: 5.424 Cond. No. 9.32
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The regression results in a p-value of 1.207e-08for the independent variable. Since the p-value is less than α = 0.05, we reject the null hypothesis.
For every additional 1 unit of weighted distance to the five Boston employment centers, the median house price increases on average by $1091.61.
Conclusion¶
This test suggests there is a significant relationship between the weighted distance to the five Boston employment centers and the median value of owner-occupied homes. Greater distances to the employment centers was associated with greater increases in housing prices.
Summary¶
- The house values in this dataset have a large range (
$5,000.00to$50,000.00), with the average house valued at$22,532.81. - There are fewer houses that bound the Charles River, but they are significantly more expensive than houses that are not on the river.
- The age of a house significantly impacts its price: older houses tend to be less valuable.
- There is a significant positive correlation between nitric oxide concentrations and the proportion of non-retail business acres per town, such that more business acres was associated with greater nitric oxide concentrations.
- There is a significant positive relationship between distance to the Boston employment centers and housing values: greater distances was related to higher house prices.
Notebook sources:
- IBM Corporation
- Coursera